-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
function: Allow more expressive array signatures #14532
Conversation
b7fc773
to
d4b74db
Compare
This is failing CI for the following reason. Previously, in So for example, if we had the type This PR modifies the logic of There's a test, I don't understand the old behavior, and why it was different for different function signatures. So I'm having trouble figuring out what the new behavior should be. We could of course sniff out the functions arguments to see if we have |
See this, #13819 (comment). |
Array { | ||
/// A full list of the arguments accepted by this function. | ||
arguments: Vec<ArrayFunctionArgument>, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One drawback of this structure is that someone can technically create a signature with no array, only for it to get rejected at runtime in get_valid_types()
. An alternative structure could be:
/// A list of arguments that come before the array.
pre_array_aguments: Vec<ArrayFunctionArgument>,
/// A list of arguments that come after the array.
post_array_aguments: Vec<ArrayFunctionArgument>,
Then we would remove the ArrayFunctionArgument::Array
variant. This would have the draw back of only allowing a single array argument in the signature, which might be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to force people to create this through a constructor that returns an error if there's no array argument.
/// An Int64 index argument. | ||
Index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset might be a better name for this? It can technically be used for sizes in functions like array_resize
or counts for functions like array_replace_n
.
Ah ok, so the idea was that if the function changed the size of the list, then it would recursively convert So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information. |
yes |
I pushed a commit for this to see how CI would like it. Happy to revert it if people don't like it. I also pushed a commit to add some validations around |
96b95e1
to
3a7c0e6
Compare
@jayzhan211 I just pushed a commit with your idea about adding a flag for array coercion. I'm feeling pretty good about the current state of this PR, but I have the following two open questions:
|
6798ca2
to
9ba3fe3
Compare
It is because of this, I think we now only coerce to list if the flag is set fn array(array_type: &DataType) -> Option<DataType> {
match array_type {
DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),
_ => None,
}
} |
pub fn new(arguments: Vec<ArrayFunctionArgument>) -> Result<Self, &'static str> { | ||
if !arguments | ||
.iter() | ||
.any(|arg| *arg == ArrayFunctionArgument::Array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of checking here, I think we can verify the validity in get_valid_types
, so the definition of signature can be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually what I originally had. The benefit of the current approach is that the error happens during startup and is very clear. Otherwise the error doesn't happen until the function is first executed and the error is not as informative. Additionally, it forces people to to validate the invariant that ArrayFunctionSignature::Array
contains at least one array when initializing the struct because they're forced to call unwrap()
or expect()
.
On startup:
DataFusion CLI v45.0.0
thread 'main' panicked at datafusion/expr-common/src/signature.rs:695:22:
contains array: "missing array argument"
stack backtrace:
0: rust_begin_unwind
at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/panicking.rs:665:5
1: core::panicking::panic_fmt
at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/panicking.rs:76:14
2: core::result::unwrap_failed
at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/core/src/result.rs:1699:5
3: core::result::Result<T,E>::expect
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1061:23
4: datafusion_expr_common::signature::Signature::array_and_index
at /home/joe/Projects/datafusion/datafusion/expr-common/src/signature.rs:691:32
5: datafusion_functions_nested::extract::ArrayElement::new
at /home/joe/Projects/datafusion/datafusion/functions-nested/src/extract.rs:121:24
6: datafusion_functions_nested::extract::array_element_udf::INSTANCE::{{closure}}
at /home/joe/Projects/datafusion/datafusion/functions-nested/src/macros.rs:95:29
7: core::ops::function::FnOnce::call_once
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
8: core::ops::function::FnOnce::call_once
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
9: std::sync::lazy_lock::LazyLock<T,F>::force::{{closure}}
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/lazy_lock.rs:212:25
10: std::sync::once::Once::call_once::{{closure}}
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/once.rs:158:41
11: std::sys::sync::once::futex::Once::call
at /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/std/src/sys/sync/once/futex.rs:176:21
12: std::sync::once::Once::call_once
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/once.rs:158:9
13: std::sync::lazy_lock::LazyLock<T,F>::force
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/lazy_lock.rs:208:9
14: <std::sync::lazy_lock::LazyLock<T,F> as core::ops::deref::Deref>::deref
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sync/lazy_lock.rs:311:9
15: datafusion_functions_nested::extract::array_element_udf
at /home/joe/Projects/datafusion/datafusion/functions-nested/src/macros.rs:98:39
16: datafusion_functions_nested::all_default_nested_functions
at /home/joe/Projects/datafusion/datafusion/functions-nested/src/lib.rs:129:9
17: datafusion::execution::session_state_defaults::SessionStateDefaults::default_scalar_functions
at /home/joe/Projects/datafusion/datafusion/core/src/execution/session_state_defaults.rs:108:31
18: datafusion::execution::session_state::SessionStateBuilder::with_default_features
at /home/joe/Projects/datafusion/datafusion/core/src/execution/session_state.rs:1088:36
19: datafusion::execution::context::SessionContext::new_with_config_rt
at /home/joe/Projects/datafusion/datafusion/core/src/execution/context/mod.rs:330:21
20: datafusion_cli::main_inner::{{closure}}
at ./src/main.rs:173:15
21: datafusion_cli::main::{{closure}}
at ./src/main.rs:131:34
22: <core::pin::Pin<P> as core::future::future::Future>::poll
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
23: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:63
24: tokio::runtime::coop::with_budget
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:107:5
25: tokio::runtime::coop::budget
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:73:5
26: tokio::runtime::park::CachedParkThread::block_on
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:31
27: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/blocking.rs:66:9
28: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
29: tokio::runtime::context::runtime::enter_runtime
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/runtime.rs:65:16
30: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
31: tokio::runtime::runtime::Runtime::block_on_inner
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:370:45
32: tokio::runtime::runtime::Runtime::block_on
at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:340:13
33: datafusion_cli::main
at ./src/main.rs:131:5
34: core::ops::function::FnOnce::call_once
at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Process finished with exit code 101
During function execution:
> SELECT array_element(1);
Error during planning: Internal error: Function 'array_element' expected at least one argument array argument.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker No function matches the given name and argument types 'array_element(Int64)'. You might need to add explicit type casts.
Candidate functions:
array_element(index)
I've removed ArrayFunctionArguments
, but still wanted to explain it's rationale in case it wasn't obvious.
@@ -94,7 +94,7 @@ impl Default for ArrayHas { | |||
impl ArrayHas { | |||
pub fn new() -> Self { | |||
Self { | |||
signature: Signature::array_and_element(Volatility::Immutable), | |||
signature: Signature::array_and_element(Volatility::Immutable, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this, we don't necessary need to wrap into a util function, it is less helpful when there are many fields
signature: Signature {
type_signature: TypeSignature::ArraySignature(
ArrayFunctionSignature::Array {
arguments: vec![
ArrayFunctionArgument::Array,
ArrayFunctionArgument::Element,
],
array_coercion: Some(ListCoercion::FixedSizedListToList),
},
),
volatility: Volatility::Immutable,
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but to avoid breaking Signature::array_and_element
we can keep one without additional array_coercion field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I don't think I fully understood what you're suggesting. I pushed d404c5e to try and address this, please let me know if that's what you had in mind.
133b120
to
b738077
Compare
Are you saying that the function should look something like this? fn array(
array_type: &DataType,
array_coercion: Option<&ListCoercion>,
) -> Option<DataType> {
match (array_type, array_coercion) {
(
DataType::FixedSizeList(field, _),
Some(ListCoercion::FixedSizedListToList),
) => Some(DataType::List(Arc::clone(field))),
(
DataType::List(_)
| DataType::LargeList(_)
| DataType::FixedSizeList(_, _),
_,
) => Some(array_type.clone()),
_ => None,
}
} Doing that causes some tests in |
It might because of |
The query that fails is
The error is
The reason is that datafusion/datafusion/functions-nested/src/extract.rs Lines 190 to 216 in 0a57469
We can't set Just to re-iterate, it's the existing behavior that ALL array functions coerce the outermost |
This commit allows for more expressive array function signatures. Previously, `ArrayFunctionSignature` was an enum of potential argument combinations and orders. For many array functions, none of the `ArrayFunctionSignature` variants worked, so they used `TypeSignature::VariadicAny` instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation. As an example, this commit also updates the signature of the `array_replace` family of functions to use a new expressive signature, which removes a panic that existed previously. There are still a couple of limitations with this approach. First of all, there's no way to describe a function that has multiple different arrays of different type or dimension. Additionally, there isn't support for functions with map arrays and recursive arrays that have more than one argument. Works towards resolving apache#14451
d404c5e
to
81e3b52
Compare
|
Thanks for the review @jayzhan211 and for all of your helpful feedback! |
I think this might also be an API change? I don't have the permissions to add the tag though. |
Thanks @jkosh44 |
This commit allows for more expressive array function signatures. Previously,
ArrayFunctionSignature
was an enum of potential argument combinations and orders. For many array functions, none of theArrayFunctionSignature
variants work, so they useTypeSignature::VariadicAny
instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation.As an example, this commit also updates the signature of the
array_replace
family of functions to use a new expressive signature, which removes a panic that existed previously.Works towards resolving #14451
Which issue does this PR close?
Works towards closing, but doesn't fully close, #14451
Are these changes tested?
Yes
Are there any user-facing changes?
No, other than removing some panics.